This report explores a dataset[1] containing 4898 white wines with 11 physiochmical variables (input) and 1 sensory variable (output). The inputs include objective tests (e.g. PH values) and the output is the median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
## [1] 4898 12
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The dataset consists of 12 numerical variables, with 4898 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
##
## poor ok good
## 183 3655 1060
The distribution of quality seems pretty “normal”. Not surprisingly, wine experts gave OK-but-mediocre score to most wines, with only a handful of the excellent (9) and the poor (3).
The wines are categorized to 3 buckets of “good”, “ok”, “poor” according to the score as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
I set the binwidth to be smaller than the default setting to take a closer but noisier look of the data. The fixed.acidity of most wines falls around 6.75, with a few outliers to the right. The volatile.acidity is skewed to the right, with most wines of 0.27 volatile acidity. The citric.acid also has a few oultliers to the right, and an interesting distribution if the binwidth is set to less than 0.05 (in the plot, it’s set to 0.02). There’re two peaks at 0.3 and 0.47 or so.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The residual.sugar is skewed, and the max value is way greater than rest of the observations. The highest values are trimmed in the second plot so more details are revealed. Most wines contain residual sugar at around 1.25. The transformed sugar distribution in the third plot appears bimodal with peaks around 1.25 and 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Despite the outliers to the far right end, the distribution of chlorides looks almost “normal” too. Most wines contain chlorides between 0.025 and 0.0625. The mean is 0.046 and the median is 0.043.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## 90%
## 0.99815
Since the density of white wines is super close to the density of water (1.000 g/mL at 3.98 °C[2]), I set the binwidth particularly small (0.0005) to get more details. The plot is a bit skewed. Most wines are “ligher” than water, the 3rd quantile is 0.9961.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
As we know, most wines are acidic. The plot corresponds to the domain knowledge. In this dataset, most samples are between 3-3.6 on the pH scale. The median and the mean almost fall at the same number, around 3.18. The distribution is mostly symmetric.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The plot is skewed, but not in an extravagant way. The mode appears around 9.3, which is lower than the 1st quantile (9.5).
##
## light medium heavy
## 4460 397 41
## wqw$alcohol.level: light
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 5.8 6.0 9.0
## --------------------------------------------------------
## wqw$alcohol.level: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 6.000 7.000 6.668 7.000 9.000
## --------------------------------------------------------
## wqw$alcohol.level: heavy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 6.000 7.000 6.659 7.000 8.000
I added a new variable by categorize wines according to the percent alcohol content:
Light-bodied wines (4460) are way more than full-bodied wines (41).
From the histogram, we notice that the modes of full-bodied and medium-bodied are 7, the light-bodied is 6. Looking at the summary, overall and averagly medium/full-bodied wines are better than light-bodied ones. However, the best wines in the dataset (score 9) are light/medium-bodied.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
All of the statistics of total.sulfur.dioxide are greater than free.sulfur.dioxide, which makes sense since the latter is a superset of the former. The former also has a few high outliers. Trim them and zoom in.
After zooming in, the plot looks quite similar to total.sulfur.dioxide. So I speculate these two variables are highly correlated.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
I wonder how the bound forms of \(SO_2\) exist in the wine, so a new variable is created by subtract free.sulfur.dioxide from total.sulfur.dioxide. I’m also interested how this variable relates to the free form \(SO_2\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Potassium Sulphate is a wine additive contributing to sulfur dioxide gas levels. Most values are below 1.0. The mode appears aruond 0.46.
There are 4898 observations in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, sugar, chlorides, density, pH, alcohol, free.sulfur.dioxide, total.sulfur.dioxide, sulphates and quality). All the variables are numerical. 11 of them are physiochemical measurements from objective tests. quality is based on sensory data from wine experts.
The main features intriguing me are the quality and alcohol variables. I’d like to investigate which chemical properties influence the wine taste. There are of course other variables playing supportive roles.
Acidity, sugar, chlorides, density and \(SO_2\) are other features I’ll take into account.
I created a new variable by assigning the quality values to a 3-level (“good”, “ok”, “poor”) factor variable.
Similar categorization was applied to the alcohol varaible, depending on the alcohol content, the observations were divided into “light”, “medium” and “heavy” groups.
A variable for the bound form of \(SO_2\) is created by subtracting the amount of free form \(SO_2\) from the total amount. I’m interested in how this variable correlated with the free form \(SO_2\), and how it contributes to the wine quality.
I trimmed a few high outliers for residual.sugar, chlorides and free.sulfur.dioxide to zoom in to the majority of the data.
I also log-transformed the right skewed residual.sugar distribution. The transformed distribution appeared bimodal.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## bound.sulfur.dioxide 0.13566071 0.15676923 0.102179337
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## bound.sulfur.dioxide 0.34484449 0.19379550 0.2635372837
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## bound.sulfur.dioxide 0.922482350 0.50444690 0.0031433874
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
## bound.sulfur.dioxide 0.13569394 -0.42692304 -0.217867760
## bound.sulfur.dioxide
## fixed.acidity 0.135660713
## volatile.acidity 0.156769227
## citric.acid 0.102179337
## residual.sugar 0.344844495
## chlorides 0.193795498
## free.sulfur.dioxide 0.263537284
## total.sulfur.dioxide 0.922482350
## density 0.504446902
## pH 0.003143387
## sulphates 0.135693943
## alcohol -0.426923036
## quality -0.217867760
## bound.sulfur.dioxide 1.000000000
There isn’t any variable that is strongly correlated with the quality. The alcohol has a meaningful but small correlation with the quality. Besides, the alcohol has a moderate negative correlation with density. This makes sense since we know that the density is affected by sugar and ethanol, while ethanol is “lighter”" (0.789 g/cm³) than water, thus more alcohol leads to lower density.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
The first plot shows 7 vertical strips. Transparency, jitter and a conditional mean on alcohol are added to adjust the overplotting. The second plot gives us a vague trend. The third figure is a box plot using the new categorical varialbe, which shows a more clear realationship. Overall the highest alcohol content tends to highest quality, while the lowest alcohol gives majority of the mediocre quality.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.575 7.300 7.600 8.525 11.800
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.800 6.400 6.900 7.129 7.600 10.200
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 6.400 6.800 6.934 7.400 10.300
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.838 7.300 14.200
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.200 6.700 6.735 7.200 9.200
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.800 6.657 7.300 8.200
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.60 6.90 7.10 7.42 7.40 9.10
The best quality does have a slightly higher median and mean of fixed acidity. But neither the scatter plot nor the box plot give us a compelling trend.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
Too high of levels of acetic acid in wine can lead to an unpleasant, vinegar taste. I thought this feature would be an effecting one. The mean of poor quality wines (score 3 and 4) do have a higher mean. But the best quality wine (score 9) doesn’t have the lowest level of acetic acid. The lowest levels mainly contributes to the OK ones.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2575 0.3450 0.3360 0.3850 0.4700
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1900 0.2900 0.3042 0.4000 0.8800
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3265 0.3600 0.7400
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 0.340 0.360 0.386 0.450 0.490
The best quality wines (score 9) has the highest median and mean levels of citric acid, which brings up the “freshness”" and pleasant flavor of wines. The second poor group has the lowest median levels of citric acid. But there isn’t too much variation in the rest of the wines.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.587 4.600 6.393 10.700 16.200
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
The median of sugar content jumps up and down across the quality levels. Most of the points crams at the bottom. There isn’t a particular trend to describe the relationship between residual sugar and quality.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
The chlorides variable contains a bunch of outlers as well. I added a coord_cartesian layer to trim them. Turns out the lower the chlorides exist, the better the the quality is.
## [1] 0.9224823
## [1] 0.615501
## [1] 0.2635373
The first plot shows total.sulfur.dioxide and bound.sulfur.dioxide are linear correlated. The free.sulfur.dioxide and total.sulfur.dioxide has a weaker linear relationship. The third plot doesn’t show a strong relationsip.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 82.5 106.0 117.3 152.2 331.0
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 67.5 102.0 101.9 133.8 195.0
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 91.0 114.0 114.5 137.0 293.5
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 76.0 97.0 101.4 123.0 243.0
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 71.00 86.00 90.99 106.00 199.00
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.00 71.00 84.00 89.45 104.50 159.50
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 62.0 82.0 82.6 96.0 112.0
Since the covariance of bound.sulfur.dioxide and quality is highest among the three sulfur variables. I only look into the plots between these two. Similar with the chlorides, the lower bound form \(SO_2\) exists, the better the quality tends to be.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0001
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0004
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0024
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0004
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0006
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9897 0.9898 0.9903 0.9915 0.9906 0.9970
The density is negatively correlated with alcohol. Not surprisingly, the best quality wines have the lowest density. But the highest density doesn’t atttribute to the worst quality.
## wqw$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.035 3.215 3.188 3.325 3.550
## --------------------------------------------------------
## wqw$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.070 3.160 3.183 3.280 3.720
## --------------------------------------------------------
## wqw$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## --------------------------------------------------------
## wqw$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## wqw$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
## --------------------------------------------------------
## wqw$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.120 3.230 3.219 3.330 3.590
## --------------------------------------------------------
## wqw$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 3.280 3.280 3.308 3.370 3.410
The correlation between the pH and the quality is not significant. But according to the plot, better wines tend to be less acidic overall.
The two plots show how sugar and alcohol affect density. With more sugar remained and less less alcohol content, the density goes higher.
The density also has a weak but meaningful positive relationship with the bound form sulfur dioxide.
The correlation between alcohol and sugar is vague, basically negative, but not strong.
quality. However, among all the features, alcohol’s impact is much more than the others. It has a positive correlation with quality.bound.sulfur.dioxide and total.sulfur.dioxide (r = 0.92), Which means the most part of sulfur dioxide are from the bound form.The distribution of alcohol (%) faceted by quality, colored by alcohol level. Mostly, medium/full-bodied wines fall in the higher quality groups. Each distribution is skewed or lack of enough observations. Interestingly, The mode is gradually moving from left to right (right-skewed to left skewed).
Besides the findings from the last plot, we can see not only the mode, but the distribution also shifts from the right to the left.
This nebulous plot depicts the relationship between chlorides and alcohol, colored by quality. Despite a few high quality wines in the bottom right corner, the plot is sectioned as top left part with lots of purple/red dots and bottom right part with yellow dots, although they overlap in the top left corner as well. High quality wines mostly contain high level of alcohol and low in chloride, but not vice versa.
This plot interestingly depicts a few relationships. It’s faceted by alcohol level, density along the y axis, apparently the three clustered move downward from the first to the third, as the density of each cluster is getting lower. Also, the pH value of the third cluster is more concentrated around 3.3, while the other two spreads out a lot, and centered smaller than 3.3.
High alcohol (%) comes with low chlorides and low residual sugar.
This is a similar plot with different y axis, but tells more information. The sugar the alcohol both contributes to the density. More sugar make the liquid denser, while alcohol pulls the density down.
## # weights: 49 (36 variable)
## initial value 9531.067910
## iter 10 value 6936.249193
## iter 20 value 5953.931769
## iter 30 value 5651.788336
## iter 40 value 5644.772540
## iter 50 value 5642.097494
## iter 60 value 5639.352159
## iter 70 value 5632.208596
## iter 80 value 5625.493185
## iter 90 value 5624.923045
## iter 100 value 5622.909941
## final value 5622.909941
## stopped after 100 iterations
## Call:
## multinom(formula = factor(quality) ~ alcohol + chlorides + residual.sugar +
## density + pH, data = wqw)
##
## Coefficients:
## (Intercept) alcohol chlorides residual.sugar density pH
## 4 -120.81014 -0.4128723 -11.15200 -0.18646306 132.44834 -0.9107893
## 5 -94.53182 -0.6217789 -11.22721 -0.07452422 109.17594 -0.7565946
## 6 66.00632 0.0018524 -11.88587 0.03593580 -61.66156 0.1050599
## 7 112.85741 0.4084028 -30.63426 0.06898806 -117.16284 1.1912225
## 8 69.44586 0.7429319 -21.97037 0.10047706 -80.46100 1.5105167
## 9 -13.76324 0.8293298 -155.80289 0.07316911 -11.04498 5.8738811
##
## Std. Errors:
## (Intercept) alcohol chlorides residual.sugar density pH
## 4 2.997051 0.2506668 5.8700800 0.05469809 2.990051 1.617535
## 5 2.845250 0.2353793 5.1183617 0.05096326 2.844434 1.541327
## 6 2.827615 0.2333035 5.1462789 0.05085075 2.843158 1.536579
## 7 2.879589 0.2352787 6.4461695 0.05145795 2.846302 1.549026
## 8 3.027010 0.2439436 8.8810602 0.05322446 2.958503 1.615454
## 9 6.608243 0.5258974 0.3584722 0.14876669 6.617603 3.316591
##
## Residual Deviance: 11245.82
## AIC: 11317.82
A multinomial logit regression is run against several variables. Quality score 3 is the reference group, so the other levels are estimated against it. The coefficients in each row are relative to the reference group. Based on the coefficients, alcohol and pH play more positive roles as the quality increases, while density and chlorides act negative. Sugar doesn’t change much across all the levels.
Alcohol along with chlorides makes a more obvious picture to determine the quality.
It’s not odd to see higher alcohol leads to lower density, but it’s a bit surprising to see higher alcohol also comes with lower chlorides and lower sugar, which may involve some chemistry knowledge and winemaking technology.
Yes. I created a multinomial logit regression to compute the coefficients. The major problem is a lot of the variables are correlated. But the other features’ influence are so little that I still keep the correlated ones. The strength here is I can see the coefficients change on different levels.
There is an unexpected spike besides the mode. It could be the result of certain winemakers adding more than average citric acid as supplements to boost the acidity.
Averagely, lower chlorides content leads to higher quality. Overall, the best white wines contain the lowest chlorides.
The higher quality wines tend to have higher alcohol (%) and lower amount of salt.
I would say alcohol and chlorides influence the wine quality most, although alcohol is correlated with density so it contributes to the taste as well. Surprisingly pH value also plays a role in determining the quality of white wines. Less acidic wines tend to create nicer flavor.
The biggest struggle is that there isn’t any feature that stands out and answers “who’s in charge” boldly. The output variable should actually be considered categorical,which is different from the course materials so the data exploration path needs adjustment to cope with this type of data. I also had difficulty picking up a reasonable model to complement the analysis.
This project also reminds me that background knowledge and common sense will tremendously help the EDA process. The data doesn’t speak for itself. It’s the analyst who interprete the data that introduce the reality to the data and reflect the data back.
[1]P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
[2]https://www.sigmaaldrich.com/catalog/product/sial/denwat?lang=en®ion=US